Mastering Data Visualization with R

Author

Martin Schweinberger

Welcome!

What You’ll Learn

By the end of this tutorial, you will be able to:

  • Choose the right visualization type for your data and research question
  • Create publication-quality plots using ggplot2
  • Customize visualizations to tell compelling data stories
  • Apply best practices for effective data communication
  • Build complex, multi-layered visualizations step-by-step

Who This Tutorial Is For

This tutorial is designed for:

  • Beginners who want to learn data visualization from scratch
  • Intermediate R users looking to enhance their plotting skills
  • Researchers who need to create professional visualizations for publications
  • Anyone interested in telling stories with data

Prerequisites

Tutorial Structure

This tutorial follows a learn-by-doing approach with three main components:

  1. Concept explanations - Understanding when and why to use each visualization
  2. Step-by-step examples - Building plots from simple to complex
  3. Hands-on exercises - Practice what you’ve learned immediately
Learning Philosophy

Rather than showing you every possible option at once, we’ll build complexity gradually. Each section introduces new concepts that build on what you’ve learned before.

Setup and Preparation

Installing Required Packages

First, let’s install all the packages we’ll need. Run this code once - it may take 3-5 minutes:

Code
# Install core packages
install.packages("dplyr")      # Data manipulation
install.packages("stringr")    # String processing
install.packages("ggplot2")    # Core plotting package
install.packages("tidyr")      # Data reshaping
install.packages("scales")     # Scale functions for ggplot2

# Install specialized plotting packages
install.packages("ggridges")   # Ridge plots
install.packages("ggstats")    # Statistical plots
install.packages("ggstatsplot")# Statistical visualizations
install.packages("EnvStats")   # Environmental statistics

# Install packages for specific plot types
install.packages("likert")     # Likert scale visualizations
install.packages("vcd")        # Categorical data visualization
install.packages("hexbin")     # Hexagonal binning
install.packages("gridExtra")  # Arranging multiple plots

# Install utility packages
install.packages("flextable")  # Pretty tables
install.packages("devtools")   # For installing from GitHub

# Install ggflags from GitHub (for country flags in plots)
devtools::install_github("jimjam-slam/ggflags")

Loading Packages

Now activate the packages for this session:

Code
library(dplyr)
library(stringr)
library(ggplot2)
library(tidyr)
library(flextable)
library(hexbin)
library(gridExtra)
library(ggflags)
library(ggstats)
library(ggridges)
library(EnvStats)
library(scales)
Pro Tip

Create a standard R script with these library calls that you can run at the start of each data visualization session!

Loading the Data

We’ll work with a dataset about preposition usage in historical English texts:

Code
# Load data
pdat <- base::readRDS("tutorials/dviz/data/pvd.rda", "rb")

Let’s examine the structure of our data:

Date

Genre

Text

Prepositions

Region

GenreRedux

DateRedux

1,736

Science

albin

166.01

North

NonFiction

1700-1799

1,711

Education

anon

139.86

North

NonFiction

1700-1799

1,808

PrivateLetter

austen

130.78

North

Conversational

1800-1913

1,878

Education

bain

151.29

North

NonFiction

1800-1913

1,743

Education

barclay

145.72

North

NonFiction

1700-1799

1,908

Education

benson

120.77

North

NonFiction

1800-1913

1,906

Diary

benson

119.17

North

Conversational

1800-1913

1,897

Philosophy

boethja

132.96

North

NonFiction

1800-1913

1,785

Philosophy

boethri

130.49

North

NonFiction

1700-1799

1,776

Diary

boswell

135.94

North

Conversational

1700-1799

1,905

Travel

bradley

154.20

North

NonFiction

1800-1913

1,711

Education

brightland

149.14

North

NonFiction

1700-1799

1,762

Sermon

burton

159.71

North

Religious

1700-1799

1,726

Sermon

butler

157.49

North

Religious

1700-1799

1,835

PrivateLetter

carlyle

124.16

North

Conversational

1800-1913

Understanding Our Data

Our dataset contains:

  • Date: When the text was written
  • Genre: Type of text (Fiction, Legal, Religious, etc.)
  • Text: Name of the source text
  • Prepositions: Relative frequency of prepositions (per 1,000 words)
  • Region: Geographic location (North/South)
  • GenreRedux: Simplified genre categories
  • DateRedux: Time periods (1150-1499, 1500-1599, etc.)

Setting Up a Color Palette

Let’s create a consistent color scheme for our visualizations:

Code
# Define custom colors
clrs <- c("purple", "gray80", "lightblue", "orange", "gray30")
Why Custom Colors?

Using a consistent color palette across all your visualizations:
- Creates a professional, cohesive look
- Makes your work more recognizable
- Ensures color accessibility
- Saves time (no need to specify colors each time)

Explore more color options:
- R Color Reference
- R Color Palettes


Part 1: Exploring Relationships

In this section, we’ll learn to visualize relationships between variables. We’ll start simple and gradually add complexity.

Scatter Plots: The Foundation

When to use scatter plots: To show the relationship between two continuous (numeric) variables.

Research questions answered:
- Is there a relationship between X and Y?
- Does the relationship vary by group?
- Are there outliers or unusual patterns?

Building Your First Scatter Plot

Let’s create a basic scatter plot step by step:

Code
# Step 1: Most basic scatter plot
ggplot(data = pdat,                    # Our dataset
       aes(x = Date,                   # X-axis variable
           y = Prepositions)) +        # Y-axis variable
  geom_point()                         # Add points

Understanding the Code
  • ggplot(): Initialize the plot
  • aes(): Define “aesthetics” (what goes where)
  • geom_point(): Add a layer of points
  • +: Add layers together (like building blocks!)

Exercise 1.1: Your First Plot

Try It Yourself!

Create a scatter plot showing the relationship between Date (x-axis) and Prepositions (y-axis) using the code above.

Questions to consider:
1. What pattern do you see?
2. Are prepositions becoming more or less frequent over time?
3. Is the relationship linear or does it curve?

Adding Color: Visualizing Groups

Now let’s add color to distinguish between genres:

Code
ggplot(pdat,
       aes(x = Date,
           y = Prepositions,
           color = GenreRedux)) +        # Color by genre
  geom_point() +
  theme_bw()                             # Clean black & white theme

What changed?
- color = GenreRedux inside aes() colors points by genre
- theme_bw() gives us a cleaner, professional look
- ggplot2 automatically creates a legend!

Customizing Colors and Shapes

Let’s make our plot publication-ready:

Code
ggplot(pdat, 
       aes(Date, Prepositions, 
           color = GenreRedux, 
           shape = GenreRedux)) +          # Different shapes for genres
  geom_point(size = 2) +                   # Larger points
  scale_shape_manual(
    name = "Genre",
    values = 1:5                           # Different point shapes
  ) +
  scale_color_manual(
    name = "Genre",
    values = clrs                          # Our custom colors
  ) +
  theme_bw() +
  theme(legend.position = "top")           # Move legend to top

Design Principle: Redundant Encoding

Using both color AND shape to show genre makes your plot more accessible:
- People with color blindness can use shapes
- Black & white printing preserves information
- Easier to distinguish groups when many overlap

Exercise 1.2: Customize Your Plot

Challenge

Modify the plot above to:
1. Change the theme to theme_minimal() or theme_classic()
2. Move the legend to the bottom
3. Try different point sizes (hint: change the size parameter)

Bonus: Try theme_void() - what happens? Why might this be useful (or not)?

Adding Statistical Layers

Trend Lines: Seeing Patterns

Let’s add trend lines to see patterns more clearly:

Code
ggplot(pdat, aes(Date, Prepositions, color = Genre)) +
  facet_wrap(vars(Genre), ncol = 4) +    # Separate panel per genre
  geom_point(alpha = 0.5) +              # Semi-transparent points
  geom_smooth(method = "lm", se = FALSE) + # Linear trend line
  theme_bw() +
  theme(
    legend.position = "none",             # No legend needed (titles show genre)
    axis.text.x = element_text(size = 8, angle = 90)
  )

New concepts:
- facet_wrap(): Create separate panels for each group
- alpha = 0.5: Make points semi-transparent (50% opacity)
- geom_smooth(): Add a smoothed trend line
- method = "lm": Use linear regression
- se = FALSE: Don’t show confidence interval

When to Use Facets

Facets (separate panels) work best when:
- You have 3-8 groups to compare
- Patterns within groups are important
- Overlapping points make one plot hard to read

Avoid facets when:
- You need to directly compare values across groups
- You have too many groups (>10)

Density Overlays: Alternative to Points

Sometimes you have too many overlapping points. Here’s an alternative:

Code
ggplot(pdat, aes(x = Date, y = Prepositions, color = GenreRedux)) +
  facet_wrap(vars(GenreRedux), ncol = 5) +
  geom_density_2d() +                    # 2D density contours
  theme_bw() +
  theme(
    legend.position = "none",
    axis.text.x = element_text(size = 8, angle = 90)
  )

What are density contours? Think of them like topographic map lines - they show where data points are concentrated.

Quick Comparison Table

Visualization Best For Limitations
Points Small-medium datasets, seeing all data Gets messy with many points
Trend lines Showing overall patterns Hides individual variation
Density contours Large datasets, concentration patterns Harder to interpret
Hex bins (next!) Very large datasets Requires uniform X-Y scales

Hex Plots: Handling Big Data

When you have thousands of points, hex plots show density efficiently:

Code
pdat |>
  ggplot(aes(x = Date, y = Prepositions)) +
  geom_hex() +                          # Hexagonal binning
  scale_fill_gradient(low = "lightblue", high = "darkblue") +
  theme_bw()

Darker hexagons = more data points in that region.

Exercise 1.4: Comparing Approaches

Synthesis Challenge

Create three plots of the same data:
1. A scatter plot with geom_point()
2. A density plot with geom_density_2d()
3. A hex plot with geom_hex()

Reflect:
- What different insights does each provide?
- Which would you use in a paper? A presentation? An exploratory analysis?


Part 2: Showing Distributions

Understanding distributions helps us see patterns, outliers, and the “shape” of our data.

Density Plots: Smooth Distribution Curves

When to use: To show how values are distributed, especially comparing groups.

Code
ggplot(pdat, aes(Date, fill = Region)) +
  geom_density(alpha = 0.5) +           # Semi-transparent densities
  scale_fill_manual(values = clrs[1:2]) +
  theme_bw() +
  theme(legend.position = c(0.1, 0.9))  # Position inside plot area

Reading density plots:
- X-axis: Values of the variable (Date)
- Y-axis: Density (higher = more data points)
- Peaks: Most common values
- Width: Spread of the data

Interpreting This Plot

The plot shows that:
- Southern texts continue into the 1800s
- Northern texts end around 1700
- There’s an overlap period where both regions produced texts

Exercise 2.1: Distribution Detective

Investigation

Create a density plot of Prepositions (not Date), colored by GenreRedux.

Questions:
1. Which genre has the highest average preposition frequency?
2. Which genre shows the most variation (widest distribution)?
3. Do any genres have unusual distributions (multiple peaks, asymmetry)?

Histograms: Counting in Bins

Histograms are similar to density plots but show actual counts:

Code
ggplot(pdat, aes(Prepositions)) +
  geom_histogram(bins = 30,              # Number of bins
                 fill = "steelblue",
                 color = "white") +      # Outline color
  theme_bw() +
  labs(title = "Distribution of Preposition Frequencies",
       x = "Prepositions per 1,000 words",
       y = "Count")

Comparing Groups with Histograms

Code
ggplot(pdat, aes(Prepositions, fill = Region)) +
  geom_histogram(bins = 30, alpha = 0.6, position = "identity") +
  scale_fill_manual(values = clrs[1:2]) +
  theme_bw() +
  theme(legend.position = "top")

Histogram vs. Bar Plot

Don’t confuse these!
- Histogram: Shows distribution of ONE continuous variable (bins are ranges)
- Bar plot: Shows counts/values for CATEGORIES (bars are discrete groups)

Exercise 2.2: Finding the Right Bin Width

Experiment

Create three histograms of Prepositions with different numbers of bins:
1. bins = 10
2. bins = 30
3. bins = 100

Discuss:
- Too few bins: What information is lost?
- Too many bins: What problems arise?
- How do you choose the “right” number?

Hint: Try the Freedman-Diaconis rule: bins = 30 is often a good starting point.

Ridge Plots: Beautiful Distribution Comparisons

Ridge plots elegantly show multiple distributions:

Code
library(ggridges)

pdat |>
  ggplot(aes(x = Prepositions, y = GenreRedux, fill = GenreRedux)) +
  geom_density_ridges() +
  theme_ridges() +
  theme(legend.position = "none") +
  labs(y = "", 
       x = "Relative frequency of prepositions")

Why ridge plots are great:
- Easy to compare shapes across many groups
- Aesthetically pleasing
- Popular in modern data visualization

Exercise 2.3: Ridge Plot Exploration

Create and Customize
  1. Create a ridge plot of Prepositions by DateRedux (instead of GenreRedux)
  2. Add color with scale_fill_manual(values = clrs)
  3. Try geom_density_ridges(alpha = 0.6, stat = "binline", bins = 20) - what changes?

Bonus: Research what stat = "binline" does. Why might you choose this over smooth densities?

Boxplots: The Statistical Summary

Boxplots show five key statistics at once:

Code
ggplot(pdat, aes(DateRedux, Prepositions, fill = DateRedux)) +
  geom_boxplot() +
  scale_fill_manual(values = clrs) +
  theme_bw() +
  theme(legend.position = "none") +
  labs(x = "Time Period", 
       y = "Prepositions (per 1,000 words)")

Reading a Boxplot

![Anatomy of a boxplot - showing median, quartiles, whiskers, and outliers]

  • Line in box: Median (50th percentile)
  • Box: Interquartile range (IQR) - middle 50% of data
  • Whiskers: Extend to 1.5 × IQR
  • Dots: Outliers beyond whiskers

Notched Boxplots: Testing Differences

Code
ggplot(pdat, aes(DateRedux, Prepositions, fill = DateRedux)) +
  geom_boxplot(notch = TRUE,             # Add notches
               outlier.colour = "red",
               outlier.shape = 2,
               outlier.size = 3) +
  scale_fill_manual(values = clrs) +
  theme_bw() +
  theme(legend.position = "none")

The Notch Test

If notches of two boxes don’t overlap → strong evidence groups differ significantly.

This is a visual “rough test” - not a replacement for proper statistics!

Enhanced Boxplots with Individual Points

Code
library(EnvStats)

ggplot(pdat, aes(DateRedux, Prepositions, fill = DateRedux, color = DateRedux)) +
  geom_boxplot(varwidth = TRUE,          # Width proportional to sample size
               color = "black", 
               alpha = 0.3) +
  geom_jitter(alpha = 0.3,               # Add individual points
              height = 0,                 # Don't jitter vertically
              width = 0.2) +              # Small horizontal spread
  facet_grid(~Region) +
  EnvStats::stat_n_text(y.pos = 65) +    # Add sample sizes
  theme_bw() +
  theme(legend.position = "none") +
  labs(x = "", 
       y = "Frequency (per 1,000 words)",
       title = "Preposition Use Across Time and Regions")

Exercise 2.4: Boxplot Mastery

Advanced Challenge
  1. Create a boxplot of Prepositions by GenreRedux
  2. Add notches
  3. Add jittered points
  4. Color by genre
  5. Add appropriate labels

Analysis questions:
- Which genres show the most variation?
- Are there any outliers? What might they represent?
- Do any genre pairs show non-overlapping notches?

Violin Plots: Best of Both Worlds

Violin plots combine boxplot statistics with density shapes:

Code
ggplot(pdat, aes(DateRedux, Prepositions, fill = DateRedux)) +
  geom_violin(trim = FALSE, alpha = 0.5) +
  scale_fill_manual(values = clrs) +
  theme_bw() +
  theme(legend.position = "none")

Violin plots show:
- Distribution shape (like density plots)
- Median and quartiles (like boxplots)
- Multimodal distributions (multiple peaks)

When to Choose Each Plot Type

Plot Type Best For Avoid When
Histogram Single variable, showing counts Comparing many groups
Density Smooth distributions, comparisons Need exact counts
Ridge Many groups, emphasis on shapes <3 groups
Boxplot Statistical summary, outliers Distribution shape matters
Violin Shape + summary, detecting multimodality Small sample sizes

Exercise 2.5: Distribution Showdown

Comparative Analysis

For the variable Prepositions grouped by GenreRedux, create:
1. A ridge plot
2. A boxplot
3. A violin plot

Reflection:
- What does each reveal that the others don’t?
- If you could only show ONE plot in a paper, which would you choose and why?
- How does sample size affect each plot type?


Part 3: Categorical Data

Working with categorical variables requires different approaches. Let’s explore the options!

Bar Plots: The Workhorse of Categories

First, let’s create summary data:

Code
bdat <- pdat |>
  dplyr::mutate(DateRedux = factor(DateRedux)) |>
  group_by(DateRedux) |>
  dplyr::summarise(Frequency = n()) |>
  dplyr::mutate(Percent = round(Frequency / sum(Frequency) * 100, 1))

# View the data
bdat
# A tibble: 5 × 3
  DateRedux Frequency Percent
  <fct>         <int>   <dbl>
1 1150-1499        34     6.3
2 1500-1599       180    33.5
3 1600-1699       225    41.9
4 1700-1799        53     9.9
5 1800-1913        45     8.4

Basic Bar Plot

Code
ggplot(bdat, aes(DateRedux, Percent, fill = DateRedux)) +
  geom_bar(stat = "identity") +          # Use actual values
  geom_text(aes(y = Percent - 3,         # Position labels
                label = paste0(Percent, "%")), 
            color = "white", 
            size = 4) +
  scale_fill_manual(values = clrs) +
  theme_bw() +
  theme(legend.position = "none") +
  labs(x = "Time Period",
       y = "Percentage of Documents",
       title = "Distribution of Texts Across Time Periods")

stat = "identity" Explained
  • geom_bar() by default counts occurrences (stat = "count")
  • Use stat = "identity" when your data already contains the values to plot
  • Think: “plot the values AS IS (their identity)”

Grouped Bar Plots

Code
ggplot(pdat, aes(Region, fill = DateRedux)) +
  geom_bar(position = position_dodge(),  # Side-by-side bars
           stat = "count") +
  scale_fill_manual(values = clrs) +
  theme_bw() +
  labs(x = "Region",
       y = "Number of Documents",
       fill = "Time Period")

When to use grouped bars:
- Comparing sub-categories within main categories
- 2-3 sub-groups work best
- Direct comparison between groups is important

Stacked Bar Plots

Code
ggplot(pdat, aes(DateRedux, fill = GenreRedux)) +
  geom_bar(stat = "count") +
  scale_fill_manual(values = clrs) +
  theme_bw() +
  labs(x = "Time Period",
       y = "Number of Documents",
       fill = "Genre",
       title = "Genre Composition Across Time Periods")

Normalized Stacked Bars (100%)

Code
ggplot(pdat, aes(DateRedux, fill = GenreRedux)) +
  geom_bar(stat = "count", position = "fill") +
  scale_fill_manual(values = clrs) +
  scale_y_continuous(labels = scales::percent) +  # Format as percentages
  theme_bw() +
  labs(x = "Time Period",
       y = "Proportion of Documents",
       fill = "Genre",
       title = "Relative Genre Composition Over Time")

Choosing Bar Plot Types

Grouped bars when:
- Comparing specific values across groups
- You have 2-3 subgroups
- Actual counts matter

Stacked bars when:
- Showing composition (parts of a whole)
- Total amount is important
- You have 3-6 subgroups

100% stacked when:
- Only proportions matter (not absolute values)
- Emphasizing compositional changes

Exercise 3.1: Bar Plot Practice

Build Your Skills
  1. Create a grouped bar plot showing GenreRedux by Region
  2. Create a stacked bar plot of the same data
  3. Create a 100% stacked version

Questions: - Which plot makes it easiest to compare genre frequencies between regions?
- Which shows total document counts best?
- What story does the 100% stacked version tell?

Likert Scale Visualizations

Survey data with Likert scales (Strongly Disagree → Strongly Agree) needs special treatment.

First, let’s load some survey data:

Code
ldat <- base::readRDS("tutorials/dviz/data/lid.rda", "rb")
head(ldat)
   Course Satisfaction
1 Chinese            1
2 Chinese            1
3 Chinese            1
4 Chinese            1
5 Chinese            1
6 Chinese            1

Method 1: Grouped Bar Plot

Code
# Summarize the data
nlik <- ldat |>
  dplyr::group_by(Course, Satisfaction) |>
  dplyr::summarize(Frequency = n())

# Create grouped bar plot
ggplot(nlik, aes(Satisfaction, Frequency, fill = Course)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  scale_fill_manual(values = clrs[1:3]) +
  geom_text(aes(label = Frequency),
            vjust = 1.6, color = "white",
            position = position_dodge(0.9), size = 3.5) +
  scale_x_discrete(
    limits = 1:5,
    labels = c("Very\nDissatisfied", "Dissatisfied", 
               "Neutral", "Satisfied", "Very\nSatisfied")
  ) +
  theme_bw() +
  labs(title = "Student Satisfaction by Course",
       x = "Satisfaction Level",
       y = "Number of Students")

Method 2: Cumulative Line Graph

Code
ggplot(ldat, aes(x = Satisfaction, color = Course)) +
  geom_step(aes(y = ..y..), stat = "ecdf", size = 1.5) +
  scale_colour_manual(values = clrs[1:3]) +
  scale_x_discrete(
    limits = 1:5,
    labels = c("Very\nDissatisfied", "Dissatisfied", 
               "Neutral", "Satisfied", "Very\nSatisfied")
  ) +
  theme_bw() +
  labs(title = "Cumulative Satisfaction Distribution",
       y = "Cumulative Proportion",
       x = "Satisfaction Level")

Reading Cumulative Plots
  • Steeper lines = responses concentrated in that range
  • Higher line at left = more dissatisfied responses
  • Lines that cross = different distribution patterns
  • Gap between lines = difference in satisfaction

Method 3: gglikert (Modern Approach)

Code
# Load survey data with multiple questions
sdat <- base::readRDS("tutorials/dviz/data/sdd.rda", "rb")

# Clean column names
colnames(sdat)[3:ncol(sdat)] <- paste0(
  "Q", str_pad(1:10, 2, "left", "0"), ": ",
  colnames(sdat)[3:ncol(sdat)]
) |>
  stringr::str_replace_all("\\.", " ") |>
  stringr::str_squish() |>
  stringr::str_replace_all("$", "?")

# Convert to factors with labels
lbs <- c("Disagree", "Somewhat\nDisagree", "Neutral", 
         "Somewhat\nAgree", "Agree")

survey <- sdat |>
  dplyr::mutate_if(is.character, factor) |>
  dplyr::mutate_if(is.numeric, factor, levels = 1:5, labels = lbs) |>
  drop_na() |>
  as.data.frame()

# Create gglikert plot
survey |>
  dplyr::select(matches("01|02|03|04")) |>
  gglikert(labels_size = 2.5,
           add_labels = FALSE) +
  ggtitle("Survey Responses to Selected Questions") +
  scale_fill_brewer(palette = "RdBu")

Likert Best Practices
  1. Order matters: Keep response scales in order (don’t sort by frequency)
  2. Neutral center: Place neutral/midpoint in the middle
  3. Diverging colors: Use colors that diverge from center (e.g., Red-Blue)
  4. Group facets: Use for comparing sub-groups
  5. Consider n: Show sample sizes when comparing groups

Exercise 3.2: Survey Visualization Challenge

Real-World Application

Imagine you’ve surveyed 100 students about their experience in an online course. Create visualizations to show:

  1. Overall satisfaction distribution (use ldat as an example)
  2. Comparison between different courses
  3. Which visualization would you use in:
    • An academic paper?
    • A presentation to administrators?
    • A quick report to instructors?

Reflect: How does your choice of visualization affect the “story” the data tells?

Pie Charts: Use With Caution

Design Warning

Pie charts are popular but problematic:
- Hard to compare slice sizes
- Difficult to estimate percentages
- Problematic with many categories
- Bar plots almost always work better

When pies might be okay:
- Very few categories (2-3)
- One category is dominant (~50%+)
- Showing parts of a whole is crucial

Here’s how to make one anyway (for comparison):

Code
# Create data for pie chart
piedata <- bdat |>
  dplyr::arrange(desc(DateRedux)) |>
  dplyr::mutate(Position = cumsum(Percent) - 0.5 * Percent)

# Create side-by-side comparison
p1 <- ggplot(bdat, aes("", Percent, fill = DateRedux)) +
  geom_bar(stat = "identity", position = position_dodge(), width = 0.7) +
  scale_fill_manual(values = clrs) +
  theme_minimal() +
  labs(title = "Bar Plot", y = "Percent")

p2 <- ggplot(piedata, aes("", Percent, fill = DateRedux)) +
  geom_bar(stat = "identity", width = 1, color = "white") +
  coord_polar("y", start = 0) +
  scale_fill_manual(values = clrs) +
  theme_void() +
  geom_text(aes(y = Position, label = paste0(Percent, "%")), 
            color = "white", size = 4) +
  labs(title = "Pie Chart")

gridExtra::grid.arrange(p1, p2, nrow = 1)

Which is easier to interpret? Why?

Exercise 3.3: Pie vs. Bar Debate

Critical Thinking

Look at the comparison above.

  1. Without looking at the numbers, which time period has the highest percentage in the pie chart?
  2. Try the same question with the bar plot.
  3. Which differences are easier to see?

Challenge: Find a situation where a pie chart might actually be the better choice. Share your reasoning!


Part 4: Advanced Visualizations

Now that you’ve mastered the basics, let’s explore some specialized and advanced plot types.

Heatmaps: Visualizing Matrices

Heatmaps use color to represent values in a matrix or table.

Code
# Create and scale data
heatdata <- pdat |>
  dplyr::group_by(DateRedux, GenreRedux) |>
  dplyr::summarise(Prepositions = mean(Prepositions)) |>
  tidyr::spread(DateRedux, Prepositions)

heatmx <- as.matrix(heatdata[, 2:5])
rownames(heatmx) <- heatdata$GenreRedux
heatmx <- scale(heatmx)  # Standardize
Code
heatmap(heatmx, 
        scale = "none",           # Already scaled
        col = colorRampPalette(c("blue", "white", "red"))(50),
        margins = c(7, 10))       # Adjust label margins

Reading heatmaps:
- Color intensity: Magnitude of value
- Dendrograms (tree diagrams): Show clustering/similarity
- Rows/columns: Can be reordered to reveal patterns

When to Use Heatmaps
  • Showing patterns in large matrices
  • Gene expression data
  • Correlation matrices
  • Time-series across categories
  • Survey responses across questions

Avoid when: - Data is sparse (many missing values)
- Categories don’t have natural ordering
- Precise values matter more than patterns

Association Plots: Expected vs. Observed

Association plots show deviations from expected frequencies:

Code
library(vcd)

# Prepare data
assocdata <- pdat |>
  dplyr::mutate(
    GenreRedux = dplyr::case_when(
      GenreRedux == "Conversational" ~ "Conv.",
      GenreRedux == "Religious" ~ "Relig.",
      TRUE ~ GenreRedux
    )
  ) |>
  dplyr::group_by(GenreRedux, DateRedux) |>
  dplyr::summarise(Prepositions = round(mean(Prepositions), 0)) |>
  tidyr::spread(DateRedux, Prepositions)

assocmx <- as.matrix(assocdata[, 2:6])
rownames(assocmx) <- assocdata$GenreRedux
Code
assoc(assocmx, shade = TRUE,
      main = "Association Plot: Genre × Time Period")

Interpreting association plots:
- Above the line: More than expected
- Below the line: Less than expected
- Blue shading: Significantly more than expected
- Red shading: Significantly less than expected
- Bar width: Contribution to chi-square statistic

Mosaic Plots: Proportional Rectangles

Code
mosaic(assocmx, shade = TRUE, legend = TRUE,
       main = "Mosaic Plot: Genre Composition Over Time")

Reading mosaic plots:
- Rectangle size: Proportion of total
- Color: Deviation from expected (like association plots)
- Position: Shows conditional relationships

Mosaic vs. Association Plots

Mosaic plots:
- Show proportions visually through rectangle size
- Better for understanding composition
- Good for presentations

Association plots:
- Emphasize statistical significance
- Better for identifying specific deviations
- Good for detailed analysis

Word Clouds: Visualizing Text

Word clouds show word frequencies. Let’s analyze political speeches:

Code
library(quanteda)
library(quanteda.textplots)

# Load speeches
clinton <- base::readRDS("tutorials/dviz/data/Clinton.rda", "rb") |> 
  paste0(collapse = " ")
trump <- base::readRDS("tutorials/dviz/data/Trump.rda", "rb") |> 
  paste0(collapse = " ")

# Create corpus
corp_dom <- quanteda::corpus(c(clinton, trump))
attr(corp_dom, "docvars")$Author <- c("Clinton", "Trump")

# Process text
corp_dom <- corp_dom |>
  quanteda::tokens(remove_punct = TRUE) |>
  quanteda::tokens_remove(stopwords("english")) |>
  quanteda::dfm() |>
  quanteda::dfm_group(groups = corp_dom$Author) |>
  quanteda::dfm_trim(min_termfreq = 200, verbose = FALSE)

Simple Word Cloud

Code
corp_dom |>
  quanteda.textplots::textplot_wordcloud(comparison = FALSE,
                                         max_words = 50)

Comparison Cloud

Code
corp_dom |>
  quanteda.textplots::textplot_wordcloud(
    comparison = TRUE,
    max_words = 50,
    color = c("blue", "red")
  )

Word Cloud Limitations

Problems:
- Words sizes are hard to compare precisely
- Common words dominate even after removing stop words
- No context (meaning can be misleading)
- Can misrepresent emphasis

Better for:
- Initial exploration
- Public presentations (engaging but not precise)
- Showing overall themes
- Complementing (not replacing) quantitative analysis

Exercise 4.1: Text Analysis

Interpretation Challenge

Looking at the comparison cloud above:

  1. What themes differentiate Clinton from Trump?
  2. What do the largest words in each color suggest about their campaign focus?
  3. What are the limitations of this visualization?
  4. What additional analyses would you want to do?

Bonus: Research “topic modeling” - how might this provide deeper insights than word clouds?

Flags in Visualizations

Adding country flags can make international comparisons more engaging:

Code
flagsdf <- data.frame(
  Region = c("Australia", "Canada", "Great Britain", "India", 
             "Ireland", "New Zealand", "United States"),
  Percent = c(0.022, 0.017, 0.025, 0.010, 0.019, 0.020, 0.036),
  Kachru = c("Inner circle", "Inner circle", "Inner circle", "Outer circle",
             "Inner circle", "Inner circle", "Inner circle"),
  country = c("au", "ca", "gb", "in", "ie", "nz", "us")
)
Code
flagsdf |>
  ggplot(aes(x = reorder(Region, Percent), 
             y = Percent, 
             country = country,
             fill = Kachru)) +
  geom_bar(stat = "identity") +
  ggflags::geom_flag(size = 5) +
  geom_text(aes(label = scales::percent(Percent, accuracy = 0.1)),
            hjust = -0.3, size = 3) +
  coord_flip(ylim = c(0, 0.045)) +
  scale_fill_manual(values = c("lightblue", "coral")) +
  scale_y_continuous(labels = scales::percent) +
  theme_minimal() +
  labs(x = "", 
       y = "Vulgar Language Percentage",
       title = "Vulgar Language Use by English-Speaking Region",
       fill = "English Type") +
  theme(legend.position = c(0.8, 0.3),
        panel.grid.major = element_blank())

When to Use Flags

Good for:
- International comparisons
- Making data more accessible to general audiences
- Adding visual interest to country-level data

Requirements:
- Need ISO country codes (e.g., “us”, “gb”, “au”)
- Works best with horizontal bar plots
- Don’t overuse - can look unprofessional in some contexts


Part 5: Time Series and Lines

Time series data shows how things change over time. Line graphs are the go-to visualization.

Basic Line Graphs

Code
pdat |>
  dplyr::group_by(DateRedux, GenreRedux) |>
  dplyr::summarise(Frequency = mean(Prepositions)) |>
  ggplot(aes(x = DateRedux, y = Frequency, 
             group = GenreRedux, 
             color = GenreRedux)) +
  geom_line(size = 1.2) +
  geom_point(size = 3) +               # Add points at data locations
  scale_color_manual(values = clrs) +
  theme_minimal() +
  labs(title = "Preposition Frequency Over Time by Genre",
       x = "Time Period",
       y = "Mean Frequency (per 1,000 words)",
       color = "Genre")

Line Graph Essentials
  • Points: Show actual data locations
  • Lines: Show trends/connections
  • Group aesthetic: Tells ggplot which points to connect
  • Color: Distinguishes different series

Smoothed Line Graphs

For continuous time variables, smoothing reveals trends:

Code
ggplot(pdat, aes(x = Date, y = Prepositions, 
                 color = GenreRedux, 
                 linetype = GenreRedux)) +
  geom_smooth(se = FALSE, size = 1.2) +
  scale_linetype_manual(
    values = c("solid", "dashed", "dotted", "dotdash", "longdash"),
    name = "Genre"
  ) +
  scale_colour_manual(values = clrs, name = "Genre") +
  theme_bw() +
  theme(legend.position = "top") +
  labs(x = "Year", 
       y = "Relative Frequency\n(per 1,000 words)",
       title = "Smoothed Trends in Preposition Use")

Why smooth?
- Reduces noise from individual data points
- Shows overall trends more clearly
- Uses LOESS (locally weighted smoothing) by default
- Helpful when you have many data points

Ribbon Plots: Showing Uncertainty

Ribbon plots display ranges (like min/max or confidence intervals):

Code
pdat |>
  dplyr::mutate(DateRedux = as.numeric(DateRedux)) |>
  dplyr::group_by(DateRedux) |>
  dplyr::summarise(
    Mean = mean(Prepositions),
    Min = min(Prepositions),
    Max = max(Prepositions),
    SD = sd(Prepositions)
  ) |>
  ggplot(aes(x = DateRedux, y = Mean)) +
  geom_ribbon(aes(ymin = Mean - SD,     # ±1 SD ribbon
                  ymax = Mean + SD), 
              fill = "lightblue", 
              alpha = 0.4) +
  geom_ribbon(aes(ymin = Min,           # Min-max ribbon
                  ymax = Max), 
              fill = "gray80", 
              alpha = 0.3) +
  geom_line(size = 1.2, color = "darkblue") +
  scale_x_continuous(labels = names(table(pdat$DateRedux))) +
  theme_minimal() +
  labs(title = "Preposition Frequency: Mean with Variation",
       x = "Time Period",
       y = "Frequency (per 1,000 words)") +
  ggplot2::annotate("text", x = 2.5, y = 180, 
           label = "Gray = Min-Max range", size = 3) +
  ggplot2::annotate("text", x = 2.5, y = 170, 
           label = "Blue = ±1 SD", size = 3)

Ribbon plots are excellent for:
- Showing uncertainty
- Displaying confidence intervals
- Visualizing ranges in forecasts
- Comparing variability across time


Part 6: Specialized Plots

Let’s explore some specialized plot types for specific scenarios.

Balloon Plots

Balloon plots show three variables: two categorical and one continuous.

Code
pdat |>
  dplyr::mutate(DateRedux = factor(DateRedux)) |>
  dplyr::group_by(DateRedux, GenreRedux) |>
  dplyr::summarise(Prepositions = mean(Prepositions)) |>
  ggplot(aes(DateRedux, GenreRedux,
             size = Prepositions,
             fill = GenreRedux)) +
  geom_point(shape = 21, alpha = 0.7) +
  scale_size_area(max_size = 20) +
  scale_fill_manual(values = clrs) +
  theme_minimal() +
  theme(legend.position = "none",
        panel.grid.major = element_line(color = "gray90")) +
  labs(title = "Preposition Frequency: Genre × Time Period",
       x = "Time Period",
       y = "Genre",
       size = "Frequency")

When to use balloon plots:
- Showing three variables simultaneously
- Matrix-style comparisons
- When circle size is intuitive for your audience

Limitations:
- Hard to compare sizes precisely
- Can get crowded with many categories
- Consider a heatmap as an alternative

Dot Plots with Error Bars

Showing means with confidence intervals:

Code
ggplot(pdat, aes(x = reorder(Genre, Prepositions, mean), 
                 y = Prepositions,
                 group = Genre)) +
  stat_summary(fun = mean,               # Plot means
               geom = "point", 
               size = 4,
               aes(color = Genre)) +
  stat_summary(fun.data = mean_cl_boot,  # Bootstrap CI
               geom = "errorbar", 
               width = 0.2,
               size = 1) +
  coord_cartesian(ylim = c(80, 200)) +
  #scale_color_manual(values = clrs) +
  theme_bw(base_size = 12) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "none"
  ) +
  labs(x = "", 
       y = "Prepositions (per 1,000 words)",
       title = "Mean Preposition Frequency by Genre",
       subtitle = "Error bars show 95% confidence intervals")

Error Bars vs. Boxplots

Error bars show:
- Specific statistic (mean, median)
- Specific uncertainty measure (SE, CI, SD)
- Cleaner look for publications

Boxplots show:
- More distributional information
- Quartiles and outliers
- Better for detecting skewness

Exercise 6.1: Comparison Challenge

Statistical Visualization

Create two plots of Prepositions by GenreRedux:
1. A dot plot with error bars (use code above)
2. A boxplot

Compare:
- What does each tell you?
- Which shows outliers better?
- Which would you use to claim “Genre X has higher frequency than Genre Y”?
- When would you choose each?

Comparative Bar Plots with Negatives

Sometimes you want to show deviation from a reference:

Code
# Create example data
Test1 <- c(11.2, 13.5, 200, 185, 1.3, 3.5)
Test2 <- c(12.2, 14.7, 210, 175, 1.9, 3.0)
Test3 <- c(13.2, 15.1, 177, 173, 2.4, 2.9)

testdata <- data.frame(Test1, Test2, Test3)
rownames(testdata) <- c(
  "Feature1_Student", "Feature1_Reference",
  "Feature2_Student", "Feature2_Reference",
  "Feature3_Student", "Feature3_Reference"
)

# Calculate deviations
FeatureA <- t(testdata[1, ] - testdata[2, ])
FeatureB <- t(testdata[3, ] - testdata[4, ])
FeatureC <- t(testdata[5, ] - testdata[6, ])

plottable <- data.frame(
  Test = rep(rownames(FeatureA), 3),
  Value = c(FeatureA, FeatureB, FeatureC),
  Feature = rep(c("FeatureA", "FeatureB", "FeatureC"), each = 3)
)

# Plot divergence
ggplot(plottable, aes(Test, Value, fill = Test)) +
  facet_grid(vars(Feature), scales = "free_y") +
  geom_bar(stat = "identity") +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  scale_fill_manual(values = clrs[1:3]) +
  theme_bw() +
  theme(legend.position = "none") +
  labs(x = "Test",
       y = "Deviation from Reference",
       title = "Learner Performance: Deviation from Native Speakers",
       subtitle = "Positive = Above reference, Negative = Below reference")

Use cases:
- Language learner vs. native speaker comparisons
- Treatment vs. control groups
- Actual vs. expected values
- Change from baseline


Part 7: Publication-Ready Plots

Let’s pull everything together to create publication-quality visualizations.

The Anatomy of a Perfect Plot

A publication-ready plot needs:

  1. Clear title and subtitle
  2. Axis labels with units
  3. Legend (when needed)
  4. Appropriate theme
  5. Readable fonts
  6. Colorblind-friendly palette
  7. Proper sizing
  8. Citation/source (when relevant)

Example: Building a Complete Plot

Code
pdat |>
  dplyr::group_by(DateRedux, GenreRedux) |>
  dplyr::summarise(
    Mean = mean(Prepositions),
    SE = sd(Prepositions) / sqrt(n()),
    N = n()
  ) |>
  ggplot(aes(x = DateRedux, y = Mean, 
             color = GenreRedux, 
             group = GenreRedux)) +
  # Data layers
  geom_line(size = 1.2) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = Mean - SE, ymax = Mean + SE),
                width = 0.2, size = 0.8) +
  # Scales
  scale_color_manual(
    name = "Text Genre",
    values = clrs,
    labels = c("Conversational", "Fiction", "Legal", 
               "Non-fiction", "Religious")
  ) +
  scale_y_continuous(
    breaks = seq(100, 200, 20),
    limits = c(100, 200)
  ) +
  # Theme and labels
  theme_bw(base_size = 14) +
  theme(
    legend.position = c(0.15, 0.65),
    legend.background = element_rect(fill = "white", color = "black"),
    panel.grid.minor = element_blank(),
    plot.title = element_text(face = "bold", size = 16),
    plot.subtitle = element_text(size = 12, color = "gray30"),
    plot.caption = element_text(size = 10, hjust = 0)
  ) +
  labs(
    title = "Historical Trends in Preposition Usage",
    subtitle = "Analysis of English texts from 1150-1913",
    x = "Time Period",
    y = "Mean Frequency (per 1,000 words)",
    caption = "Source: Penn Parsed Corpora of Historical English (PPC)\nError bars show ±1 SE"
  )

Saving High-Quality Figures

Code
# Save for publication
ggsave("preposition_trends.png",
       width = 10, height = 6, dpi = 300)

# Save for presentation
ggsave("preposition_trends.pdf",
       width = 10, height = 6)

# Save for web
ggsave("preposition_trends_web.png",
       width = 10, height = 6, dpi = 150)
File Format Guide

PNG - Best for:
- Web use
- Presentations
- Figures with photos or complex gradients
- When file size matters

PDF - Best for:
- Publications (journals often require vector)
- Posters
- When scaling is needed
- Print materials

TIFF - Best for:
- Some journal requirements
- Archival purposes

DPI (resolution):
- Web: 72-150 dpi
- Presentations: 150 dpi
- Print: 300 dpi
- Posters: 600 dpi

Color Accessibility

Making plots accessible to colorblind readers:

Code
library(viridis)

# Original plot with problematic colors
p1 <- pdat |>
  dplyr::group_by(DateRedux, GenreRedux) |>
  dplyr::summarise(Mean = mean(Prepositions)) |>
  ggplot(aes(DateRedux, Mean, fill = GenreRedux)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_manual(values = c("red", "green", "blue", "yellow", "purple")) +
  ggtitle("Problematic Colors") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

# Improved with viridis palette
p2 <- pdat |>
  dplyr::group_by(DateRedux, GenreRedux) |>
  dplyr::summarise(Mean = mean(Prepositions)) |>
  ggplot(aes(DateRedux, Mean, fill = GenreRedux)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_viridis_d() +
  ggtitle("Colorblind-Friendly") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

gridExtra::grid.arrange(p1, p2, nrow = 1)

Colorblind-friendly palettes:
- scale_color_viridis_d() / scale_fill_viridis_d()
- scale_color_brewer() with “Set2”, “Dark2”, or “Paired”
- ColorBrewer palettes (many are colorblind-safe)

Exercise 7.1: Publication Polish

Final Project

Create a publication-ready visualization:

  1. Choose any relationship in the data

  2. Create a complete plot with:

    • Informative title and subtitle
    • Proper axis labels with units
    • A colorblind-friendly palette
    • Appropriate theme
    • Source citation
    • Legend if needed
  3. Save it in three formats (PNG, PDF, web-optimized PNG)

  4. Write a 2-3 sentence caption that could accompany the figure in a paper

Peer review: Exchange with a colleague - is your plot self-explanatory?


Part 8: Choosing the Right Plot

The hardest part of data visualization is choosing which plot to make. Let’s develop a decision framework.

Decision Tree

Start Here: What’s Your Data Structure?

1. One Continuous Variable

Goal: Show distribution

  • Few data points (<50): Dot plot, strip plot
  • Medium data (50-500): Histogram, density plot
  • Many data (500+): Density plot, violin plot
  • Want statistics: Boxplot

2. One Continuous + One Categorical

Goal: Compare groups

  • Compare distributions: Boxplot, violin plot, ridge plot
  • Compare means: Dot plot with error bars
  • Show all data: Jittered points, beeswarm plot

3. Two Continuous Variables

Goal: Show relationship

  • Basic relationship: Scatter plot
  • Many points (overlap): Hex plot, 2D density
  • Add trend: Add geom_smooth()
  • Compare groups: Color by group, facet by group

4. Two Categorical Variables

Goal: Show associations

  • Frequencies: Bar plot (grouped or stacked)
  • Proportions: 100% stacked bar, mosaic plot
  • Statistical test: Association plot

5. Time Series

Goal: Show change over time

  • Discrete time points: Line graph with points
  • Continuous time: Smoothed line, ribbon plot
  • Multiple series: Colored lines, small multiples
  • Uncertainty: Ribbon plot, error bars

6. Three+ Variables

Goal: Show multivariate relationships

  • Third variable categorical: Color/shape, facets
  • Third variable continuous: Color gradient, bubble size
  • Many variables: Heatmap, parallel coordinates

Common Scenarios and Solutions

Scenario 1: Survey Results

Data: Likert scale responses from 5 groups

Options:
1. gglikert plot (best for multiple questions)
2. Stacked bar chart (100% for proportions)
3. Faceted bar charts (best for comparing specific responses)

Choose based on:
- Number of questions (many → gglikert)
- Focus on specific categories (faceted bars)
- Showing overall sentiment (stacked bars)

Scenario 2: Experimental Results

Data: Measurements from treatment and control groups

Options:
1. Boxplots (show distributions + outliers)
2. Violin plots (show distribution shape)
3. Bar plot with error bars (show means + uncertainty)

Choose based on:
- Sample size (small → dot plot, large → violin)
- Presence of outliers (boxplot shows these)
- Simplicity needed (bar + error = simplest)

Scenario 3: Geographic Data

Data: Values across countries/regions

Options:
1. Map (when geography matters)
2. Bar plot with flags (when ranking matters)
3. Dot plot (when precision matters)

Choose based on:
- Audience familiarity with geography
- Whether spatial patterns matter
- Number of regions (too many for map)

Exercise 8.1: Plot Selection Challenge

Real-World Scenarios

For each scenario, choose the best plot type and explain why:

Scenario A: You have test scores (0-100) for students in 4 different teaching methods. You want to know if methods differ significantly.

Scenario B: You’ve measured reaction times (milliseconds) in 20 trials for each of 50 participants.

Scenario C: You surveyed 200 people about their agreement (5-point scale) with 10 statements about climate change.

Scenario D: You have daily temperature readings for 5 cities over one year.

For each:
1. What plot type would you use?
2. What alternatives did you consider?
3. What would make you change your choice?

Common Mistakes to Avoid

❌ Mistake 1: 3D Charts

Problem: Hard to read, distort data

Code
# DON'T DO THIS
# 3D plots are almost never appropriate for data visualization

Instead: Use 2D charts with proper grouping/faceting

❌ Mistake 2: Dual Y-Axes

Problem: Can be misleading, hard to interpret

Instead:
- Facet plots (separate panels)
- Normalize to same scale
- Use secondary metric only if essential

❌ Mistake 3: Too Many Colors

Problem: Confusing, hard to distinguish

Instead:
- Limit to 5-7 colors
- Use ColorBrewer palettes
- Consider faceting instead

❌ Mistake 4: Truncated Y-Axis (Bar Plots)

Problem: Exaggerates differences

Rule: Bar plots should always start at zero

Exception: Dot plots with error bars can use truncated axes

❌ Mistake 5: Chartjunk

Problem: Decoration distracts from data

Avoid:
- Unnecessary grid lines
- Decorative backgrounds
- 3D effects
- Shadows and gradients (usually)

Instead: Use theme_minimal() or theme_bw() as starting points

The Grammar of Graphics Framework

ggplot2 is based on “The Grammar of Graphics” - understanding this helps you think about plots systematically.

Every plot has: 1. Data - What you’re visualizing 2. Aesthetics (aes) - What goes where (x, y, color, size, etc.) 3. Geometries (geom) - How to display it (points, lines, bars, etc.) 4. Scales - How aesthetics map to visual properties 5. Facets - Subplots 6. Themes - Non-data visual elements

Building blocks:

Code
ggplot(data = <DATA>) +
  aes(x = <X>, y = <Y>, color = <GROUP>) +  # Aesthetics
  geom_<TYPE>() +                            # Geometry
  scale_<AESTHETIC>_<TYPE>() +               # Scales
  facet_<TYPE>(vars(<VARIABLE>)) +           # Facets
  theme_<STYLE>() +                          # Theme
  labs(title = <TITLE>, ...)                 # Labels

This modular approach lets you build any plot by combining these components!


Final Challenge: Capstone Project

Comprehensive Data Visualization Project

You’ve learned all the essential techniques. Now put them together!

Your Task

Create a complete data story using the pdat dataset (or your own data). Your project should include:

Required Components:

  1. At least 3 different plot types from different sections:
    • One showing distributions
    • One showing relationships
    • One showing categorical comparisons
  2. Publication-ready quality:
    • Proper titles, labels, and captions
    • Colorblind-friendly palette
    • Appropriate themes
    • Clear legends
  3. A narrative:
    • 2-3 paragraph introduction explaining your question
    • Transition text between plots explaining what each shows
    • 2-3 paragraph conclusion summarizing findings
  4. Technical elements:
    • At least one faceted plot
    • At least one customized plot (colors, themes, labels)
    • Proper use of aesthetics (color, shape, size)

Example Questions to Explore

  • How has language use evolved across different genres over time?
  • Are there regional differences in writing styles?
  • What patterns exist in the data that might surprise a linguist?
  • Can you predict time period based on linguistic features?

Deliverables

  1. R Markdown document with all code and narrative
  2. 3-5 high-quality figures saved as PNG (300 dpi)
  3. One “highlight figure” that tells your main story

Evaluation Criteria

Your project will be strong if it:
- ✅ Chooses appropriate plot types for each question
- ✅ Uses visualization best practices (clear labels, readable fonts, etc.)
- ✅ Tells a coherent story with the data
- ✅ Shows technical mastery of ggplot2
- ✅ Includes thoughtful interpretation of results
- ✅ Is reproducible (all code runs without errors)

Bonus points for:
- Creative combinations of techniques
- Particularly insightful findings
- Exceptional visual design
- Going beyond the tutorial examples


Resources and Next Steps

Online Resources

Interactive Learning:
- R Graph Gallery - Hundreds of examples with code
- Data to Viz - Decision tree for choosing plots
- From Data to Viz - Interactive explorer

Reference:
- ggplot2 documentation
- R Color Reference
- ColorBrewer - Choose palettes

Advanced Topics:
- Patchwork - Combining multiple plots
- gganimate - Animated visualizations
- plotly - Interactive plots
- rayshader - 3D visualizations (when appropriate!)

Cheat Sheets

Download and print these:
- ggplot2 cheat sheet
- RStudio IDE cheat sheet

Common Problems and Solutions

“My plot is too crowded”

Solutions:
- Facet into multiple panels
- Filter to top N categories
- Use color to highlight key groups
- Try a different plot type (e.g., heatmap instead of scatter)

“Colors look different in different programs”

Solutions:
- Use colorblind-safe palettes
- Test in target environment
- Save as PDF (preserves colors better)
- Specify colors explicitly with hex codes

“Text overlaps in my plot”

Solutions:
- Rotate labels: theme(axis.text.x = element_text(angle = 45, hjust = 1))
- Use ggrepel::geom_text_repel()
- Reduce number of labels
- Increase plot size
- Abbreviate labels

“Error: object not found”

Solutions:
- Check spelling of variable names
- Ensure data is loaded
- Check if library is loaded
- Use str(data) to see variable names

“Plot looks pixelated”

Solutions:
- Increase DPI: ggsave(..., dpi = 300)
- Save as PDF (vector format)
- Increase figure size
- Avoid resizing after saving

Where to Get Help

  1. Stack Overflow: Tag your question with [r] and [ggplot2]
  2. RStudio Community: https://community.rstudio.com/
  3. R for Data Science Slack: https://www.rfordatasci.com/
  4. Twitter #rstats: Active, helpful community

Practice Datasets

To continue learning, try these datasets:

Built into R:
- mpg - Fuel economy data
- diamonds - Diamond prices and properties
- economics - US economic time series
- midwest - Demographic data

From packages:
- gapminder - Global health and wealth
- nycflights13 - Flight data
- fivethirtyeight - Data from news articles
- palmerpenguins - Alternative to iris dataset

Your Learning Path

Beginner → Intermediate:
1. ✅ Master basic geoms (point, line, bar, box)
2. ✅ Understand aesthetics and mapping
3. ✅ Learn faceting
4. ✅ Customize themes
5. ⬜ Combine multiple plots (patchwork)
6. ⬜ Create custom themes
7. ⬜ Build functions for repeated plots

Intermediate → Advanced:
1. ⬜ Master scales and coordinates
2. ⬜ Custom annotations
3. ⬜ Statistical transformations
4. ⬜ Extension packages (gganimate, ggraph, etc.)
5. ⬜ Interactive visualizations (plotly)
6. ⬜ Creating your own geoms
7. ⬜ Publication-ready figure workflows


Citation & Session Info

Schweinberger, Martin. 2025. Mastering Data Visualization with R. Brisbane: The University of Queensland. url: https://ladal.edu.au/tutorials/dviz/dviz.html (Version 2025.02.07).

@manual{schweinberger2026dviz,
  author = {Schweinberger, Martin},
  title = {Mastering Data Visualization with R},
  note = {https://ladal.edu.au/tutorials/dviz/dviz.html},
  year = {2026},
  organization = {The University of Queensland, School of Languages and Cultures},
  address = {Brisbane},
  edition = {2026.02.07}
}

Session Information

Code
sessionInfo()
R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: Australia/Brisbane
tzcode source: internal

attached base packages:
[1] grid      stats     graphics  grDevices datasets  utils     methods  
[8] base     

other attached packages:
 [1] viridis_0.6.5           viridisLite_0.4.2       quanteda.textplots_0.95
 [4] quanteda_4.2.0          scales_1.3.0            ggstats_0.10.0         
 [7] ggflags_0.0.4           ggstatsplot_0.13.0      EnvStats_3.0.0         
[10] gridExtra_2.3           vip_0.4.1               PMCMRplus_1.9.12       
[13] rstantools_2.4.0        hexbin_1.28.5           flextable_0.9.7        
[16] tidyr_1.3.1             ggridges_0.5.6          tm_0.7-16              
[19] NLP_0.3-2               vcd_1.4-13              likert_1.3.5           
[22] xtable_1.8-4            ggplot2_3.5.1           stringr_1.5.1          
[25] dplyr_1.1.4            

loaded via a namespace (and not attached):
  [1] rstudioapi_0.17.1       jsonlite_1.9.0          datawizard_1.0.0       
  [4] correlation_0.8.6       magrittr_2.0.3          TH.data_1.1-3          
  [7] estimability_1.5.1      SuppDists_1.1-9.8       farver_2.1.2           
 [10] rmarkdown_2.30          ragg_1.3.3              vctrs_0.6.5            
 [13] memoise_2.0.1           paletteer_1.6.0         askpass_1.2.1          
 [16] base64enc_0.1-6         effectsize_1.0.0        htmltools_0.5.9        
 [19] BWStest_0.2.3           Formula_1.2-5           htmlwidgets_1.6.4      
 [22] plyr_1.8.9              sandwich_3.1-1          emmeans_1.10.7         
 [25] zoo_1.8-13              cachem_1.1.0            uuid_1.2-1             
 [28] lifecycle_1.0.4         iterators_1.0.14        pkgconfig_2.0.3        
 [31] Matrix_1.7-2            R6_2.6.1                fastmap_1.2.0          
 [34] digest_0.6.39           colorspace_2.1-1        rematch2_2.1.2         
 [37] patchwork_1.3.0         textshaping_1.0.0       Hmisc_5.2-2            
 [40] labeling_0.4.3          compiler_4.4.2          fontquiver_0.2.1       
 [43] withr_3.0.2             backports_1.5.0         htmlTable_2.4.3        
 [46] psych_2.4.12            MASS_7.3-61             openssl_2.3.2          
 [49] tools_4.4.2             foreign_0.8-87          lmtest_0.9-40          
 [52] stopwords_2.3           zip_2.3.2               statsExpressions_1.6.2 
 [55] nnet_7.3-19             glue_1.8.0              nlme_3.1-166           
 [58] checkmate_2.3.2         cluster_2.1.6           reshape2_1.4.4         
 [61] generics_0.1.3          gtable_0.3.6            data.table_1.17.0      
 [64] xml2_1.3.6              foreach_1.5.2           pillar_1.10.1          
 [67] splines_4.4.2           lattice_0.22-6          renv_1.1.1             
 [70] survival_3.7-0          gmp_0.7-5               tidyselect_1.2.1       
 [73] fontLiberation_0.1.0    knitr_1.51              fontBitstreamVera_0.1.1
 [76] xfun_0.56               stringi_1.8.4           yaml_2.3.10            
 [79] evaluate_1.0.3          codetools_0.2-20        kSamples_1.2-10        
 [82] officer_0.6.7           gdtools_0.4.1           tibble_3.2.1           
 [85] multcompView_0.1-10     cli_3.6.4               RcppParallel_5.1.10    
 [88] rpart_4.1.23            parameters_0.24.1       systemfonts_1.2.1      
 [91] munsell_0.5.1           Rcpp_1.0.14             zeallot_0.1.0          
 [94] coda_0.19-4.1           parallel_4.4.2          bayestestR_0.15.2      
 [97] Rmpfr_1.0-0             mvtnorm_1.3-3           slam_0.1-55            
[100] insight_1.0.2           purrr_1.0.4             rlang_1.1.7            
[103] fastmatch_1.1-6         multcomp_1.4-28         mnormt_2.1.1           

Acknowledgments

This tutorial builds on the excellent work of the R and tidyverse communities. Special thanks to:

  • Hadley Wickham for creating ggplot2
  • The RStudio team for tools and resources
  • All package authors cited throughout
  • The LADAL team for supporting this tutorial

Back to top

Back to HOME


Quick Reference Tables

Common Geoms Reference

Geom Use For Example
geom_point() Scatter plots Relationship between 2 continuous variables
geom_line() Line graphs Time series, trends
geom_bar() Bar plots Categorical frequencies
geom_boxplot() Boxplots Distribution summaries
geom_violin() Violin plots Distribution shapes
geom_histogram() Histograms Single variable distributions
geom_density() Density plots Smooth distributions
geom_smooth() Trend lines Adding regression/smoothing
geom_errorbar() Error bars Showing uncertainty
geom_tile() Heatmaps Matrix visualizations
geom_hex() Hex bins Large scatter plots
geom_density_2d() 2D density Concentration in 2D

Common Aesthetics

Aesthetic Controls Example Variables
x X-axis position Continuous or categorical
y Y-axis position Continuous or categorical
color Border/line color Groups, categories
fill Fill color Groups (for bars, boxes, etc.)
size Point/line size Continuous variables
shape Point shape Categories (max ~6)
alpha Transparency Continuous (0-1)
linetype Line type Categories

Common Themes

Theme Description
theme_bw() Black and white, minimal
theme_minimal() Minimal theme, no background
theme_classic() Classic look, axis lines
theme_void() Empty theme
theme_dark() Dark background
theme_grey() Default ggplot2 theme

Position Adjustments

Position Use For
position_dodge() Side-by-side bars
position_stack() Stacked bars/areas
position_fill() 100% stacked
position_jitter() Avoid overplotting
position_identity() Use exact values

Remember: The best visualization is the one that clearly communicates your message to your audience! 📊